Today we are going to talk about data wrangling in the tidyverse, which is a collection of packages that share the same underlying “philosophy” (Data structures/grammar) used for pretty much any data science task you can think of.
Note: The tidyverse package is really a collection of a bunch of other packages. When you install/load tidyverse it installs and loads all of those packages, but you could just as well install/load them individually.
First, we need some data. Let’s go download some and load them into R. This will be a much more common data workflow than using data from a package as we did last week. Click this link to download the timber harvest data from Oregon between 1962 and 2019. Documentation for these data is here.
Reading CSV’s is as simple as using the read_csv function (or Base R’s read.csv), but you have to tell R exactly where that file is on your computer. That is where things can get a little bit tricky, because everyone has different folder names and file structures. For example, the following code will not work on your computer, because it has the file path from my computer.
# Loading packages
library(pacman)
p_load(tidyverse)
# Reading the CSV
timber_df = read_csv(
"/Users/esaulnier/Dropbox (University of Oregon)/My Mac (MacBook-Pro.local)/Downloads/Timber_Harvest_Data_1962-2019.csv"
)
There are lots of hack-y ways to deal with this that involve manually typing in file paths, but R-Studio projects and the here package are the best way to deal with this problem. Lets take a quick tangent to discuss these before getting into the tidyverse.
Projects make it easy to manage data and code files that are being used together. For example – you could have a separate project for each of the problem sets in this class, or just one project for all of the material in this class. If you have not created a project for this class, lets make one now.
To create a new project, click File > New Project or in R-Studio.
This will launch a pop-up that asks you whether you want to use a “New Directory”, “Existing Directory”, or “Version Control”. If the folder (“Directory”) that you want to create a project for already exists, click “Existing Directory” and then find the folder you want to use. Otherwise create a new directory, call it something like ec421-w22.
Once you have created and opened the project, R-Studio tells you what project you have open in the upper right hand corner of the screen.
Note: You can click the triangle next to the project name to easily switch between projects or create new ones.
So why go to all this trouble? There are lots of benefits, but the main one is that when you have a project open, R-Studio automatically sets the project’s directory as the working directory. What is a working directory? It’s a file path on your computer that is used as the default when reading files into R.
# Checking the working directory
getwd()
## [1] "/Users/esaulnier/Dropbox (University of Oregon)/My Mac (MacBook-Pro.local)/Documents/ge-assignments/EC421-Lab-W22/02-tidyverse"
# Showing the files available in this directory
list.files()
## [1] "figures" "lab-02-tidyverse_files"
## [3] "lab-02-tidyverse.html" "lab-02-tidyverse.md"
## [5] "lab-02-tidyverse.Rmd" "Timber_Harvest_Data_1962-2019.csv"
Note: R-Markdown files automatically set the working directory to the one that the file is saved in.
You can see that my working directory is the folder I have created for this lab, EC-421-Lab-W22/02-tidyverse, and all the files related to this lab are in that directory, including the Timber_Harvest_Data_1962-2019.csv file we want to load! Copy the csv file in your project directory, and now you can read it using the following code.
# Reading csv
timber_df = read_csv("Timber_Harvest_Data_1962-2019.csv")
But wait, this still won’t work if I try to run it outside of the R-Markdown file I am writing these notes in. That’s where the here package swoops in to save the day.
The here package is extraordinarily useful and quite simple.
All you have to do is use the
here() function to tell R where to look relative to the project’s root directory. All the here() function does is spit out the entire file path for the file you specify, but without you having to manually type it out.
# Loading package
p_load(here, janitor)
# What does here() do?
here("02-tidyverse/Timber_Harvest_Data_1962-2019.csv")
## [1] "/Users/esaulnier/Dropbox (University of Oregon)/My Mac (MacBook-Pro.local)/Documents/ge-assignments/EC421-Lab-W22/02-tidyverse/Timber_Harvest_Data_1962-2019.csv"
# Reading the csv!
timber_df_raw = read_csv(here("02-tidyverse/Timber_Harvest_Data_1962-2019.csv")) |>
clean_names() # Removes spaces and capital letters from column names
I don’t like having spaces or capital letters in my column names, which the janitor::clean_names() function takes care of easily. Now that we’ve got our data loaded, lets see what is in there using some tools from last week.
Note: You can reference which package a function comes from using ::, e.g. package::function().
# Exploring the data a bit
head(timber_df_raw)
## # A tibble: 6 × 12
## year county blm usfs state county_and_municipal industry nip
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1962 BAKER 5858 58900 0 0 0 6540
## 2 1962 BENTON 33001 6100 4376 0 22830 21793
## 3 1962 CLACKAMAS 38314 208600 175 0 108556 11393
## 4 1962 CLATSOP 0 0 24340 0 206050 6049
## 5 1962 COLUMBIA 0 0 0 0 15486 21183
## 6 1962 COOS 128475 33700 25910 0 307841 27053
## # … with 4 more variables: native_american <dbl>, total_of_volume <dbl>,
## # total_private <dbl>, total_public <dbl>
summary(timber_df_raw)
## year county blm usfs
## Min. :1962 Length:2088 Min. : 0 Min. : 0
## 1st Qu.:1976 Class :character 1st Qu.: 0 1st Qu.: 746
## Median :1990 Mode :character Median : 303 Median : 13271
## Mean :1990 Mean : 17218 Mean : 49869
## 3rd Qu.:2005 3rd Qu.: 10972 3rd Qu.: 56470
## Max. :2019 Max. :552354 Max. :857200
## NA's :5
## state county_and_municipal industry nip
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 5586 1st Qu.: 1854
## Median : 0 Median : 0.0 Median : 30276 Median : 7380
## Mean : 6179 Mean : 929.9 Mean : 80365 Mean : 13136
## 3rd Qu.: 4857 3rd Qu.: 376.0 3rd Qu.:101001 3rd Qu.: 17318
## Max. :123712 Max. :21878.0 Max. :802104 Max. :316492
## NA's :76 NA's :3
## native_american total_of_volume total_private total_public
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 35358 1st Qu.: 12238 1st Qu.: 7247
## Median : 0 Median : 93184 Median : 46195 Median : 29078
## Mean : 2285 Mean : 169790 Mean : 95747 Mean : 74042
## 3rd Qu.: 0 3rd Qu.: 192265 3rd Qu.:118043 3rd Qu.: 78847
## Max. :103149 Max. :1856429 Max. :862399 Max. :1086607
## NA's :17
Looks like each row is for a county in a particular year and columns are timber harvest volumes for different categories of land owners. The data structure we loaded is called a “tibble”, which is really just the tidyverse’s version of a fancy data frame.
Q: How many rows are in our data?
So why is the tidyverse called the tidyverse? It is centered around the idea of “tidy” data, which is data stored under three rules:
Generally this means that data is “long” rather than “wide”. Is our timber_df tidy? NO, each row has many columns containing a value for timber harvest volume. We can create a new variable, land_owner, and then a single column timber_volume with all of the value for harvest volume in it using the tidyr::pivot_wider function.
# Tidying the data
timber_df =
timber_df_raw |>
pivot_longer(
cols = 3:12, # Columns we want to pivot
names_to = "land_owner", # column name for labels
values_to = "timber_volume" # column name for values
)
# Looking at the tidy data
head(timber_df, n = 10)
## # A tibble: 10 × 4
## year county land_owner timber_volume
## <dbl> <chr> <chr> <dbl>
## 1 1962 BAKER blm 5858
## 2 1962 BAKER usfs 58900
## 3 1962 BAKER state 0
## 4 1962 BAKER county_and_municipal 0
## 5 1962 BAKER industry 0
## 6 1962 BAKER nip 6540
## 7 1962 BAKER native_american 0
## 8 1962 BAKER total_of_volume 71298
## 9 1962 BAKER total_private 6540
## 10 1962 BAKER total_public 64758
Now we have a much nicer structure that is easy to work with!
Q: How many rows are in our data now?
Last week we learned that we could select columns using df$col_name or columns and rows with syntax like df[i,j]. But the dplyr package has some functions that are much more powerful for doing this, select and filter.
# Selecting Columns
timber_df |> select(year, county)
## # A tibble: 20,880 × 2
## year county
## <dbl> <chr>
## 1 1962 BAKER
## 2 1962 BAKER
## 3 1962 BAKER
## 4 1962 BAKER
## 5 1962 BAKER
## 6 1962 BAKER
## 7 1962 BAKER
## 8 1962 BAKER
## 9 1962 BAKER
## 10 1962 BAKER
## # … with 20,870 more rows
# Filtering rows using logical expressions
timber_df |> filter(year == 2019 & land_owner == "industry")
## # A tibble: 36 × 4
## year county land_owner timber_volume
## <dbl> <chr> <chr> <dbl>
## 1 2019 BENTON industry 71926
## 2 2019 CLACKAMAS industry 74338
## 3 2019 CLATSOP industry 135899
## 4 2019 COLUMBIA industry 116311
## 5 2019 COOS industry 116909
## 6 2019 CURRY industry 58282
## 7 2019 DOUGLAS industry 547271
## 8 2019 HOOD RIVER industry 9649
## 9 2019 JACKSON industry 54263
## 10 2019 JOSEPHINE industry 15494
## # … with 26 more rows
Q: What was the volume of timber harvested on USFS land in Lane County in 2005?
Note: There are many ways to specify which columns we want to select, but often the obove syntax is the easiest.
We can sort data using dplyr::arrange(). Lets see which year had the highest total timber harvest on public land in Lane County. By default, arrange sorts numbers in ascending order, so we can tell it to sort in descending order using desc().
# Years with most harvest on public land
timber_df |>
filter(county == "LANE" & land_owner == "total_public") |>
arrange(desc(timber_volume))
## # A tibble: 58 × 4
## year county land_owner timber_volume
## <dbl> <chr> <chr> <dbl>
## 1 1972 LANE total_public 1086607
## 2 1965 LANE total_public 1071688
## 3 1964 LANE total_public 1064266
## 4 1973 LANE total_public 1058080
## 5 1963 LANE total_public 987009
## 6 1969 LANE total_public 964206
## 7 1968 LANE total_public 927865
## 8 1962 LANE total_public 904893
## 9 1988 LANE total_public 900308
## 10 1966 LANE total_public 857397
## # … with 48 more rows
One of the more powerful functions here is dplyr::mutate(), which allows us to create new columns. These new columns can be functions of existing columns (or even functions of columns you’ve created in the same mutate function call). Remember to assign the output to a name if you want to save the new column(s) you created.
# Creating new variables
timber_df =
timber_df |>
mutate(
injunction = year >= 1990,
injunction_federal = injunction & land_owner %in% c("blm","usfs","total_public")
)
# Showing the df with new columns
timber_df |> filter(year %in% 1989:1990 & county == "BAKER")
## # A tibble: 20 × 6
## year county land_owner timber_volume injunction injunction_federal
## <dbl> <chr> <chr> <dbl> <lgl> <lgl>
## 1 1989 BAKER blm 3198 FALSE FALSE
## 2 1989 BAKER usfs 47525 FALSE FALSE
## 3 1989 BAKER state 0 FALSE FALSE
## 4 1989 BAKER county_and_municipal 0 FALSE FALSE
## 5 1989 BAKER industry 5084 FALSE FALSE
## 6 1989 BAKER nip 14242 FALSE FALSE
## 7 1989 BAKER native_american 0 FALSE FALSE
## 8 1989 BAKER total_of_volume 70049 FALSE FALSE
## 9 1989 BAKER total_private 19326 FALSE FALSE
## 10 1989 BAKER total_public 50723 FALSE FALSE
## 11 1990 BAKER blm 5738 TRUE TRUE
## 12 1990 BAKER usfs 49229 TRUE TRUE
## 13 1990 BAKER state 0 TRUE FALSE
## 14 1990 BAKER county_and_municipal 0 TRUE FALSE
## 15 1990 BAKER industry 9423 TRUE FALSE
## 16 1990 BAKER nip 22193 TRUE FALSE
## 17 1990 BAKER native_american 0 TRUE FALSE
## 18 1990 BAKER total_of_volume 86583 TRUE FALSE
## 19 1990 BAKER total_private 31616 TRUE FALSE
## 20 1990 BAKER total_public 54967 TRUE TRUE
Here I created a logical column for years after 1990, when there was an injunction placed on new logging on federal land. The Timber Wars Podcast does a great job explaining why that happened. I also created a column, injunction_federal that tells us years and land owners that were affected by the injunction. Notice that I used the newly created injunction column to do this.
Note: The %in% operator is a logical operator that I forgot to discuss last week but is very useful. it checks if the object on the RHS is in the vector on the LHS (e.g. 1 %in% 1:10 returns TRUE, but 1 %in% 2:10 returns FALSE)
We can perform aggregations easily using the dplyr::summarize() function. For example, we can calculate a bunch of different statistics for harvest on BLM land in 2019.
# Harvest on BLM land in 2019
timber_df |>
filter(year == 2019 & land_owner == "blm") |>
summarize(
tot_harvest = sum(timber_volume),
avg_harvest = mean(timber_volume),
min_harvest = min(timber_volume),
max_harvest = max(timber_volume),
num_rows = n() # n() is a useful helper function that tells you the number of rows
)
## # A tibble: 1 × 5
## tot_harvest avg_harvest min_harvest max_harvest num_rows
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 143372 3983. 0 31656 36
These aggregations can also be done by group using the dplyr::group_by() function. For example, we can do the same calculations as above, but for all years rather than just 2019. Now instead of doing aggregations across the entire dataset, the aggregations are done at the group level, with the result being the same number of rows as there are groups in the data.
# Harvest on BLM land
timber_df |>
filter(land_owner == "blm") |>
group_by(year) |>
summarize(
tot_harvest = sum(timber_volume),
avg_harvest = mean(timber_volume),
min_harvest = min(timber_volume),
max_harvest = max(timber_volume),
num_rows = n()
) |>
arrange(desc(year))
## # A tibble: 58 × 6
## year tot_harvest avg_harvest min_harvest max_harvest num_rows
## <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 2019 143372 3983. 0 31656 36
## 2 2018 158126 4392. 0 34574 36
## 3 2017 160390 4455. 0 35693 36
## 4 2016 182462 5068. 0 40424 36
## 5 2015 193275 5369. 0 45504 36
## 6 2014 208826 5801. 0 45162 36
## 7 2013 165116 4587. 0 37232 36
## 8 2012 148216 4117. 0 42048 36
## 9 2011 164954 4582. 0 48315 36
## 10 2010 132857 3690. 0 39897 36
## # … with 48 more rows
Q: What was the total timber harvest for the whole state of Oregon on Non-Industrial Private (“nip”) land before and after 2000? Hint: Create a new variable that is an indicator for pre/post 2000, and then group by that variable.
Now that we know how to manipulate data, lets get on to the fun stuff, visualization!
ggplot2 is another package in the tidyverse that implements the “grammar of graphics” to create a very elegant data visualization tool. If you want to read more about visualization in ggplot2, check out the Data Visualization chapter of R for Data Science. There are three basic inputs you need to create a plot using the ggplot function: Data, an aesthetic mapping, and layers.
Aesthetics are some visual property of your plot: The most common ones are going to be x axis, y axis, size, color, fill, and shape. We have some experience with these from plotting in Base R last week. You specify what you want the aesthetics to be using aes().
# Harvest in Lane County for each land owner by year
timber_df |>
filter(county == "LANE") |>
ggplot(aes(x = year, y = timber_volume)) + # Aesthetic mapping
geom_point() # Adds a point layer
These plots are built in layers, meaning that once you specify your base plot using ggplot() you can add any number of layers on top of it using +. These layers are different types of visualizations: points, lines, histograms, bars, etc. The aesthetics are passed down to each layer by default, but you can also specify aesthetics within a layer.
The above plot is confusing because each point is a different land owner, and this is time series data, so a line connecting annual observations is more appropriate. Lets create colors for each land owner and connect the points using geom_line()
# Harvest on USFS land by year
timber_df |>
filter(county == "LANE") |>
ggplot(aes(x = year, y = timber_volume)) +
geom_point(color = "gray")+
geom_line(aes(color = land_owner)) # Adds a point layer with color
If you want to change the look of one of the layers for the entire graph, as I did with the point color above, then specify it outside of the aes() function.
Q: What would happen if I wrote geom_point(aes(color = land_owner)) instead?
Additional layers can be added to change the color scales and visual attributes of the plot. Let’s focus just on the total private and public land, but summarize across the entire state.
# Plot of total private and public harvest by year
timber_df |>
filter(land_owner %in% c("total_private","total_public")) |>
group_by(year, land_owner) |>
summarize(tot_harvest = sum(timber_volume)) |>
ggplot(aes(x = year, y = tot_harvest/1e6, color = land_owner)) +
geom_line(size = 1.5) +
geom_vline(xintercept = 1990, linetype = "dashed") +
labs(
title = "OR Public vs Private Timber Harvest",
x = "Year",
y = "Timber Volume (Million Boardfeet)"
) +
scale_color_manual(
name = "Land Owner", # Name of the color legend
values = c("#5f6880","#ec7662"), # Hex codes for colors we want
labels = c("Private", "Public") # Names for color labels
) +
scale_x_continuous(
breaks = seq(1960,2020, by = 10)
) +
theme_classic()
There is a lot going on in this plot! Let’s break it down step by step:
dplyr functions we learned earlierggplot(), specifying the aesthetics year on x-axis, tot_harvest on the y-axis (scaled down by 1 million), and land_owner as the colorgeom_line() to draw the lines for the data and increased the thickness of the line with the size argumentgeom_vline() to signify the injunctionlabs()scale_color_manual(). I chose two specific colors I wanted, but there are lots of built in color palettes that you can use (e.g. scale_color_viridis())theme_classic(): This changes things like the background, font/size of labels, axis lines, etc. All of these settings can be changed manually if you want to!We’ll talk about data visualization using ggplot2 a lot over the course of this term. You can do so much more than what I’ve shown you today: Density plots, maps, animations, 3D, and much more.
Q: Make a plot that shows me average timber harvest on public vs private land before and after 1990. Make it look good with nice labels! Hint: geom_col() makes bar charts. The fill color is altered with the fill aesthetic. Also if you use this (don’t have to), it would be nice to set position = "dodge" inside geom_col() so that the bars are next to each other rather than on top of each other.